Quantitative Text Analysis of ‘The Guardian’ headlines

Executive summary

The following report presents a quantitative text analysis of “The Guardian” newspaper’s headlines in the periods going from 18 October 2021 and 28 November 2021 to apply a sentiment analysis around the negotiations of COP26 that took place in Glasgow from October 31st and November 12th. Such descriptive analysis aims at finding eventual changes in the opinions and/or political positioning of the abovementioned media outlet. ADD CONCLUSIONS

Statement of contributions

  • Laura
  • Kat
  • Nassim

Introduction

The 26th Conference of the Parties that took place in Glasgow has represented a crucial moment for climate policy negotiations.

The majority of the most influential international leaders attended the event to discuss on future global action regarding climate mitigation and adaptation, together with non-state actors and internationally renowned personalities. Such occasion gained substantial media attention from all over the world, with peaks in the intervals right before the starting of the COP (the so-called ‘PreCOP’ events), during the actual happening of the Conference, and right after the conclusion of such event.

However, media outlets would approach climate change in different ways that reflect their political positioning: the headlines, the highlights as well as the frequently mentioned topics would differ based on political position.

We collect data from the headlines of a British newspaper to analyse possible trends and changes in sentiment along the specific timeframe that goes from the weeks right before the Conference until the period right after it.

Motivation

As public policy students concerned about climate negotiations, we are interested in investigating the opinions and attitudes expressed by media outlets in the above-mentioned periods of time. We are mainly concerned on whether the standpoint and the perspective of media outlets changed over time, and how the trends of this change could have developed.

The relevance of our analysis stands in our curiosity for newspapers’ behavior concerning international, critical occasions as the COP26. Understanding whether they actually aim to inform people with the addition of a particular sentiment (that could also aim at reflecting a general feeling from the readers), or whether they prefer to remain neutral and objectively report factual events, could foster a deeper comprehension of the role of information and media in climate change developments.

To accomplish that, we decided to analyse the sentiment of just one media outlet that is published in the COP26 host country, UK, that is The Guardian. This newspaper is considered as a left-leaning, according to YouGov findings. For instance, topics as climate financing were among the most critical ones put on the table of COP26 negotiations, hence choosing a non-neutral outlet - which would have endorsed such topic - can show more compelling results in terms of changes in the political positioning with respect to the outcomes of the Conference, that can eventually be reflected in the headlines.

Research question

The principal objective of this research aims at analysing trends in the attitude of the newspaper headlines with respect to COP26 topics. Then, the other questions related to this analysis are related to two macro areas. The first one tackles the original and main interest, that is whether the ratio of positive and negative words changes over time, and in which period this eventually happens. Questions related to this area are:

  • Are the words used in the headlines of ‘The Guardian’ providing a specific sentiment?
  • What is the ratio of positive over negative words in the collected data? Does this ratio changes over time?
  • Overall, is this sentiment more positive or negative? Which are the most positive and negative words?
  • Are there any possible interesting patterns among the most frequent words that could be inspected further?

The second area concerns a more specific analysis that also takes into account how such results could change when using different measurement instruments, in this case, dictionaries. Questions related to this area are:

  • Is the sentiment analysis consistent across different dictionaries?
  • Do any differences and/or overlaps induce any political interpretation? Are the results relevant for political interpretation?

Methods

The text analysis is performed mainly using two different packages: tidytext and quanteda. Both packages follow the tidyverse design philosophy. The main difference between these two tools is that quanteda works with Corpus objects, proper of the NLP logic, while tidytext can process texts in their character format. We employed both tools to carry on all of our research questions in the most appropriate way. Specifically, tidytext was useful to build analyses and visualizations with dates, in a simpler manner than with the quanteda document level variables. The quanteda package was instead particularly useful for the targeted sentiment analysis we conducted, together with the fact that it was possible to check the consistency of results also with another dictionary, the LSD2015 one.

Limitations

Being a newspaper of the host country of the climate negotiations, The Guardian would not represent an ideal sample of headlines that would allow us to deduce if COP26 has met the expectations or not through the sentiment analysis. Indeed, the results would only show the changes in opinions for the specific political leaning that such outlet represents. However, the values of this project are to apply procedures of sentiment analysis after scraping information from the web and present them to the user in an accessible format. Therefore, it is necessary to acknowledge the very limited scope of this analysis. The relevance of such investigation can only be applied to this specific and small sample.

Additionally, a further limitation concerns the dates that have been scraped from The Guardian website. Given the used web-scraping strategy, the most recent dates (December and end of November 2021) present some missing values caused by a heterogeneous format in the website pages. For demonstration purposes we simply dropped those missing values, further limiting the scope of the analysis.

Retrieving the data

The webscraping, cleaning and formatting section of the analysis can be found in the R script scraping_and_data_cleaning that is available in the repository.

The webscraping strategy adopted consists in downloading the headlines from multiple pages of the newspaper website by date (static webscraping). The formatting step includes transformation of dates into the correct format with lubridate and and data preparation for the quantitative text analysis with tidytext. In this part, words regarding the main topic of the headlines (“cop26”, “glasgow”,“climate”,“change”) were expected to be very frequent, other than not contributing to a specific senitment, so they have been removed as stopwords.

Explorative analysis

Through the exploration of the collected data, we aim at understanding which are the most frequent words and whether they could have a role in our investigation.

Thanks to a frequency table and an explorative WordCloud, we visualize the most frequent words. We identify ‘crisis’ as the most frequent word (other than the customized stopwords) used in the headlines during the COP26 period.

Thanks to the keyword in context table, it is explored quickly whether any case in which the word ‘crisis’ has a role different from being part of the ‘climate crisis’ bigram is present. It is not found to be the case. Since the main topic of COP26 is exactly that of ‘tackling the climate crisis’, this word, despite clearly indicating a negative sentiment, does not represent relevant information. It is therefore dropped.

WordCloud

Comment on wordcloud

word n
crisis 58
net 52
world 50
video 48
australia 41
johnson 40
happened 38
global 34
boris 33
emissions 32

Keywords in context for the most frequent word

Sentiment analysis

The sentiment analysis applied to the collected headlines is conducted using a dictionary-based method. The three used dictionaries are:

  • ‘Bing et Al.’,

  • ‘AFINN’

  • ‘Lexicoder Sentiment Dictionary’ (LSD2015)

The choice of these dictionaries is mainly based on common practice and on the objective of our research to check the sentiment of the headlines around the climate negotiations, quantify them and detect any potential patterns and the consistency of these results.

From the tidytext package, we use the ‘Bing et al.’ and the ‘AFINN’ dictionaries. These are general-purpose lexicons based on unigrams (single words). The first one classifies the words into negative or positive, while the second one scales the sentiment by assigning a value between a range of -5 and +5, classifying words with values very negative and very positive respectively.

From the quanteda package, the Lexicoder Sentiment Dictionary represents a more than valid alternative, due to its particular versatility with respect to sentiment analysis for political communication (Young, L. & Soroka, S., 2012). Such dictionary consists of 2,858 “negative” sentiment words and 1,709 “positive” sentiment words. The novelty of Young and Soroka approach stands in a further set of 2,860 and 1,721 negations of negative and positive words, respectively. However, we did not find such additional set useful for our research purposes.

Bing et al. visualization

Sentiment frequency

Most Frequently occuring Words per sentiment

All Sentiment Interactive Plot by Date

Positive vs Negative Sentiment Interactive Plots by Date

Affin Visualization

Affin analysis for all headlines

Affin analysis for Boris Johnson headlines

LSD2015 Visualization

Distribution of Sentiments

Concluding remarks

Further research suggestions

Resources